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ABSTRACT 


“Hackers” have written malicious programs to exploit on- 
line services intended for human users. As a result, service 
providers need a method to tell whether a web site is being 
accessed by a human or a machine. We expect a parallel 
scenario as spoken language interfaces become common. 

In this paper, we describe a Reverse Turing Test (i.e., an 
algorithm that can distinguish between humans and com- 
puters) using speech. We present a test that depends on the 
fact that human recognition of distorted speech is far more 
robust than automatic speech recognition techniques. 

Our analysis of 18 different sets of distortions demon- 
strates that there are a variety of ways to make the problem 
hard for machines. In addition, humans and speech recog- 
nition systems make different kinds of mistakes, and this 
difference can be employed to improve discrimination. 


1. INTRODUCTION 


The use of the Internet as a means for distributing valuable 
information and content have also made it an attractive tar- 
get for “hackers.” Attacks involving malicious programs 
(“bots”) that attempt to exploit online services intended for 
human users are already common. These programs con- 
sume resources, harass users, make attempts to guess pass- 
words, steal and re-purpose copyrighted content, and invade 
privacy by reconstructing sensitive data from public views. 

As a result, there is a need for automatic methods to tell 
whether the entity attempting to access a service is a human 
or a machine. This has come to be known as a Reverse Tur- 
ing Test, or RTT (or sometimes a Human Interactive Proof). 
Unlike the test originally proposed by Alan Turing [1], a Re- 
verse Turing Test is administered by a computer, not a hu- 
man. For the test to be considered effective, humans should 
be able to pass it with ease, but machines should have a low 
probability of passing. 

While it may seem that passwords and/or biometrics 
could provide a solution, one must keep in mind that these 
approaches require pre-registration. An RTT will work even 
if the user is anonymous and has never used the service be- 
fore. The RTT problem is fundamentally different from val- 
idating a known user. Indeed, a Reverse Turing Test should 


be applied before authentication to prevent automated at- 
tacks on passwords. 

Coates, Baird, and Fateman [2], and von Ahn et al. [3] 
have developed such a test based on a visual perception task. 
Their ideas are based on the observation that optical char- 
acter recognition (OCR) systems are not as adept at reading 
degraded word images as humans are. That RTT is now 
used commercially to protect a free email service [4]. 

Speech-based services are proliferating because of their 
ease-of-use, portability, and potential for hands-free oper- 
ation. Building a “bot” to navigate a spoken language in- 
terface is a tractable problem, especially if there is a fixed 
sequence of predefined prompts. Hence, we anticipate at- 
tacks on such systems and a similar need to prevent ma- 
chines from abusing speech-based resources intended for 
human users. Previous work on speech RTTs can be found 
in [5, 6]. 

As in the vision RTTs, we exploit the fact that certain 
pattern recognition tasks are significantly harder for ma- 
chines than they are for humans. We will use text-to-speech 
synthesis (TTS) to generate tests, and make use of the limi- 
tations of state-of-the art automatic speech recognition (ASR) 
technology. (We require only that RTTs cannot be broken 
cheaply or rapidly. Clearly, any RTT can be broken by hir- 
ing a human.) 

In this paper, we present the core of a spoken language 
RTT. We assume a user with a cell phone; the test may con- 
sist of having the system speak: “Please enter the following 
digits on your keypad: ..”’ followed by a short, random 
digit string. The speech would be synthesized in a way that 
ASR is likely to fail the test, e.g., by distorting the signal or 
adding “difficult” background noise to it after synthesis. 


2. PROCEDURES 


We designed a set of 18 RTTs based on different distortions 
of a speech signal (Table 2) to explore a broad range of pos- 
sibilities. To test our RTTs, we synthesized 200 random 5- 
digit sequences using the Bell Labs English text-to-speech 
system [7], with the default male voice. We next distorted 
the signals and ran the Bell Labs speech recognition sys- 
tem [8] on them, with a grammar that allowed any digit se- 


Name Description Error 
Ratio 
white e White Gaussian noise, 4000 Hz bandwidth. 15 
buzz e Sine waves at 700 Hz, 2100 Hz, 3500 Hz. > 20 
song e Bell Labs Song (pop/rock). >15 
chopin e Chopin Polonaise for Piano No. 6, Op. 53. > 20 


chant e Gregorian chant. > 20 


female œ Three overlapping instances of a female > 20 
voice reading numbers. 

pow e 10 ms bursts of white Gaussian noise, re- > 20 
peated every 100 ms. 

rnoise e Every 100 ms, a section of the signal is re- > 20 
placed by white noise of the same RMS am- 
plitude. 

cell e For each 30 ms window, decide if the data > 10 
was lost. If so, and previous not lost, duplicate 
previous. If so, and previous is lost, set to zero. 
Simulates a bad cell phone channel. 

echo e Three echoes. > 20 

filter e A random zero-phase filter. >5 

distort @ Apply AGC on a 60 ms window, raise toa > 20 
power, multiply by original amplitude. 

mxa è rnoise + chopin > 20 

mxb è song + echo > 20 

mxc e white + pow > 20 

mxd e female + buzz >5 

mxe è rnoise + distort > 20 

mxf o filter + distort 3 


Table 1. Tested RTTs. The error ratio is an estimate of 
the largest ratio of the ASR to human utterance error rates. 
The components of all mixtures were chosen to cause equal 
error rates for ASR utterance error rates near 85%. 


quence. 

The recognition results were compared to the original 
digit sequences using approximate string matching ([9] and 
references therein) to identify added, missing, and substi- 
tuted digits. From this we obtained, for each distortion, per- 
digit and per-utterance error rates, as well as a confusion 
matrix showing the frequency of each type of error. 

A similar procedure was followed to test how well hu- 
mans could recognize the signals. The authors each lis- 
tened to 10 randomly-chosen digit sequences for each type 
of distortion. In these tests, the audio signals were presented 
twice with a one second pause in between. The subject then 
typed his/her interpretation and pressed return to listen to 
the next signal. The tests began at the most severe distor- 
tion of each type, and terminated when the subject correctly 
identified all 10 sequences. One set (white) received 67 rep- 
etitions per person for a more accurate confusion matrix. 

The confusion matrices are scaled to make all diagonal 
confusion matrices identical. This step is necessary because 
the distance between raw confusion matrices is generally 
nonzero, even if there are no errors (i.e. the matrices are di- 
agonal). This happens because we do not explicitly balance 


the frequencies of each digit, so that one test may see more 
instances of, say, “3” than another. Hence, we use a scaled 
confusion matrix, S = P - C - Q, where P and Q are diago- 
nal matrices chosen so that the row- and column-sums of $' 
are unity. While deletions and insertions are conventionally 
placed in the same matrix as substitutions they are actually a 
fundamentally different kind of error. Specifically, Cgetins 
is on the diagonal and is missing (zero). Thus, insertions 
and deletions must be treated differently in the scaling. We 
make an ad-hoc modification to C before scaling: we set 
Cdel,ins — Xij Ci j /N (where N = 11 is the number of 
rows and columns in C’), and then set Sael ins — 0. 


3. ANALYSIS: IS IT POSSIBLE TO DISTINGUISH 
HUMANS FROM MACHINES? 


Human perception of speech in noisy environments is fairly 
robust. Normal-hearing listeners need a signal-to-noise ra- 
tio (SNR) of approximately 1.5 dB to recognize speech [10], 
while ASR systems require a much more favorable SNR of 
5 to 15 dB [11]. Our test results show an even wider gap 
between human and ASR performance. 

Figures 1-4 plot the error rates we measured in four ex- 
periments; these correspond to four distinct types of distor- 
tion: additive noise, deleting segments of the speech, adding 
echoes and filtering the signal. The four curves in each chart 
represent the error rates for ASR on a per-utterance (square) 
and per-symbol (diamond) basis along with per-utterance 
(X) and per-symbol (triangle) rates for humans. 
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Fig. 1. Results for white noise experiment (white). 


In figure 1, white is simply additive white noise. The 
x-axis shows SNR, and the y-axis represents the rate of 
recognition errors. In the ASR curves, each datum repre- 
sents scores from 200 utterances or 1,000 digits, while in 
the human curves, each datum represents 201 utterances or 
1,005 digits as pooled from the three listeners. ASR per- 
formance starts deteriorating when the SNR reaches 15 dB, 
and breaks down completely by 3 dB. Human performance 
is only starting to deteriorate at 0 dB SNR. 
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Fig. 2. Results for replacement noise experiment (rnoise). 


Figure 2 shows the results for rnoise, which replaces a 
segment of speech with white noise every 100 ms, as we 
vary the length of the replaced segment. Human scores in 
this figure and in Figures 3-4 are based on 30 utterances 
(150 digits) per datum. As can be seen, ASR starts having 
problems when 2 ms (2%) of the speech is replaced. Amaz- 
ingly, human recognition remains perfect at 60 ms (60% re- 
placement), and 80% of the symbols are still correct even 
when 80% of the speech is missing. 


1 


0.9 —=— ASR Utterance 
0.8 —> ASR Symbol 
—— Human Utterance 


0.7 —+— Human Symbol 


0.6 


0.5 


Error Rate 


N 


2 2 2 2 2 9 2 29 2 29 29 2 2 2B 2B E O ea 
ss TU DTD DTT TOT FT TBF BT BDBT TTB DBD 
ges eg 8 ASF eK OH TO AK ST AO 


Signal to Noise Ratio 


Fig. 3. Results for echo experiment (echo). 


Our results for echo, which adds three echoes to the 
speech with delays of 60, 132, and 192 ms, are presented 
in Figure 3. The horizontal axis shows the relative ampli- 
tude of the first echo to the speech, while later echoes are 
5 dB and 10 dB quieter. ASR performance starts declining 
when the SNR is 12 dB, while human performance is per- 
fect until —3 dB. This behavior is typical of many of the 
other tests we performed. 

Among our tests, filter (Figure 4) showed the smallest 
difference between ASR and humans. This is a zero-phase 
frequency domain filter, with a frequency response chosen 
randomly every 30 Hz. The control parameter sets the stan- 
dard deviation of the gain, expressed in dB. 

Our findings are that the gap between ASR and human 
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Fig. 4. Results for random filter experiment (filter). 


performance appears to be wide enough to administer a ro- 
bust Reverse Turing Test. In general, our tests show that hu- 
mans can handle noise levels about 15 dB higher than ASR 
can. Likewise, humans can understand digit strings when 
more than half of the signal is missing, a point at which the 
machine already has a 100% error rate for utterances. 


4. ANALYSIS: DO HUMANS MAKE THE SAME 
KINDS OF ERRORS AS MACHINES? 


Not only is the error rate larger for an ASR system than for 
humans, but the pattern of errors is significantly different as 
well. Figure 5 shows a gray-scale plot of errors made by 
the machine and humans on the white data set. The figures 
represent noise levels chosen to give matching error rates. 
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Fig. 5. Scaled confusion matrices at 35% error rate for hu- 
mans (-3dB SNR, left) and machine (+10dB SNR, right). 


Based on this observation, we could improve an RTT by 
looking at the kinds of errors that are made. For instance, 
humans readily confuse the digits “2” and “3” in the pres- 
ence of white noise, so such an error should not be con- 
sidered evidence that a machine is attempting to access the 
system. On the other hand, an “8” is never misunderstood 
as a “2” by humans but is sometimes by ASR, so such an 
error would suggest a machine's presence. 


5. ANALYSIS: HOW MANY DISTINCT KINDS OF 
DISTORTION ARE THERE? 


One obvious attack on this kind of RTT is to build a classi- 
fier that, working from the input signal, attempts to identify 
the distortion that was applied, and then sends the input to 
an ASR system trained specifically for that distortion. ASR 
can be tuned to work well in the presence of noise [12], but 
it first needs to be trained with a large representative corpus 
collected from the environment in question. 


Assuming that it is possible to build the necessary clas- 
sifier, the question then arises “How many different ASR 
systems would one need to train in order to break the RTT?” 
To make this question tractable, we approach it by examin- 
ing the confusion matrices derived from the experiments we 
have performed; these will serve as a proxy for training and 
testing a large number of ASR systems. 


To do this, we follow logic presented elsewhere in the 
context of OCR systems [13]. We assume that if the ASR 
confusion matrices corresponding to two distorted signals 
are different enough, then separate recognizers will be needed 
because the two signals are fundamentally different. To 
make a quantitative comparison, we need to define a dis- 
tance measure: we use the 2-norm of the difference of the 
scaled confusion matrices, D(a, 3)? = dig (S85 (@) — 
Si i (8))?, where œ and 2 refer to two different distortions, 
and S is a function of how the speech is distorted. 


We can calibrate our notion of “different enough” by 
picking an error rate at which the ASR system fails and 
measuring D(a, perfect) for various types of distortion at 
that rate. We choose noise levels that yield an utterance er- 
ror rate of 85% +6%, with a symbol error rate of 34% +5%. 
The confusion matrices at these levels were found to be dis- 
tance Dgs = 2.1 + 0.1 from the perfect, error-free case. We 
assume that if the confusion matrices for any two distor- 
tions differ by this amount or more, an ASR system trained 
for one distortion will not be able to function on the other. 


In our experiments, the average distance between a pair 
of distortions (excluding the mixtures) is 2.0 + 0.9, quite 
close to Dgs, so we would expect that, in general, an ASR 
system could not be trained to handle two different types of 
distortions simultaneously. The mixtures tend to be closer 
to their components, a distance of 1.8 + 0.6 away, but this is 
still far enough to conclude that an ASR system trained on 
either component of the mixture would probably not per- 
form well on the mixture itself. Consequently, it appears 
there are enough ways of generating fundamentally distinct 
distortions that the RTTs we have described would resist an 
attack using any single, well-trained ASR system. 


6. CONCLUSIONS 


In this paper, we have described our work towards building 
a speech-based Reverse Turing Test. We show that the gap 
between ASR and human performance is wide for a variety 
of noise effects, and that there are opportunities to exploit 
differences between the patterns of errors that humans and 
machines make. The RTT is a fundamentally new way of 
using speech synthesis and recognition technologies. 
The authors thank Olivier Siohan for assistance. 
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